Introduction

The data set that we chose comes from the IES National Center for Education Statistics. The specific study is titled Education Longitudinal Study of 2002 or ELS:2002. ELS:2002 represents a major longitudinal effort designed to provide trend data about critical transitions experienced by students as they proceed through high school and into postsecondary education. The 2002 sophomore cohort was followed, initially at 2-year intervals, to collect policy-relevant data about educational processes and outcomes. These data focus on areas of student learning, student tragectories, student persistence and access to college, as well as entry into the workforce. The baseline year for the study was Spring term 2002. A national sample of high school sophomores were surveyed along with their parents, teachers, adminstrators, and librarians. The first follow-up took place two years later in 2004. Students who remained in the same school for both 2002 and 2004 were resurveyed and tested in multiple areas including mathmatics. Students that did not attend the same school for any reason (i.e. transferred, dropped out, early graduation) were adminstered a questionnaire. Demographic information was collected for both groups along with academic scores.

Research Question

Do the teacher’s reported student math scores across the two timepoints (i.e., base year and first follow-up) correlate with 1) students’ sex, 2) students’ race/ethnicity, and 3) mother’s highest education level of students?

Data Information

Variables

Independent Variables:

1.[BYSEX]

1 = “Male”

2 = “Female”

-4 = “Nonrespondent”

-8 = “Survey component legitimate skip/NA”

2. [BYRACE]

1 = “Amer. Indian/Alaska Native, non-Hispanic”

2 = “Asian, Hawaii/Pac. Islander,non-Hispanic”

3 = “Black or African American, non-Hispanic”

4 = “Hispanic, no race specified”

5 = “Hispanic, race specified”

6 = “More than one race, non-Hispanic”

7 = “White, non-Hispanic”

-4 = “Nonrespondent”

-8 = “Survey component legitimate skip/NA”

3. [BYMOTHED]

1 = “Did not finish high school”

2 = “Graduated from high school or GED”

3 = “Attended 2-year school, no degree”

4 = “Graduated from 2-year school”

5 = “Attended college, no 4-year degree”

6 = “Graduated from college”

7 = “Completed Master’s degree or equivalent”

8 = “Completed PhD, MD, other advanced degree”

-4 = “Nonrespondent”

-8 = “Survey component legitimate skip/NA”

-9 = “Missing”

Dependent Variables:

1. [BYTXMSTD] Math test standardized score

Description: Math standardized T Score. The standardized T score provides a norm-referenced measurement of achievement, that is, an estimate of achievement relative to the population (spring 2002 10th-graders) as a whole. It provides information on status compared to peers (as distinguished from the IRT-estimated number-right score which represents status with respect to achievement on a particular criterion set of test items). The standardized T score is a transformation of the IRT theta (ability) estimate, rescaled to a mean of 50 and standard deviation of 10.

2. [F1TXMSTD] F1 math test standardized score

Description: Math standardized T Score. The standardized T score provides a norm-referenced measurement of achievement, that is, an estimate of achievement relative to the population (spring 2004 12th-graders) as a whole. It provides information on status compared with peers (as distinguished from the IRT-estimated number-right score which represents status with respect to achievement on a particular criterion set of test items). Although the T score is reported for all F1 in-school responding students (including transfer students), regardless of grade level, the comparison group for standardizing is the 12th grade population. The standardized T score is a transformation of the IRT theta (ability) estimate, and has a mean of 50 and standard deviation of 10 for the weighted subset of 12th-graders in the sample.

Data Cleaning

#retrieve data
#els <- read_csv("./data/els_02_12_byf3pststu_v1_0.csv")
#select columns 
#els <- els %>% dplyr::select(STU_ID, BYSEX, BYRACE, BYMOTHED, BYTXMSTD, F1TXMSTD)
#save the revised (cleaned) data to csv
#write.csv(els,"./data/els_cleaned.csv", row.names = FALSE)

els <- read_csv("./data/els_cleaned.csv")

#replace missing data code to NA
els$BYSEX <- na_if(els$BYSEX, -4)
els$BYSEX <- na_if(els$BYSEX, -8)
els$BYRACE <- na_if(els$BYRACE, -4)
els$BYRACE <- na_if(els$BYRACE, -8)
els$BYMOTHED <- na_if(els$BYMOTHED, -4)
els$BYMOTHED <- na_if(els$BYMOTHED, -8)
els$BYMOTHED <- na_if(els$BYMOTHED, -9)
els$BYTXMSTD <- na_if(els$BYTXMSTD, -8)
els$F1TXMSTD <- na_if(els$F1TXMSTD, -8)

#remove if the row doesn't have both BY and F1 math scores
els <- els %>% 
  filter(!is.na(BYTXMSTD) | !is.na(F1TXMSTD))

#rename
els <- els %>%
  mutate(BYSEX = dplyr::recode(BYSEX, 
                          `1` = "Male", 
                          `2` = "Female"),
         BYRACE = dplyr::recode(BYRACE, 
                          `1` = "Native American/Alaskan", 
                          `2` = "Asian", 
                          `3` = "Black", 
                          `4` = "Hispanic (no race specified)", 
                          `5` = "Hispanic (specified)", 
                          `6` = "More than one race, non-Hispanic", 
                          `7` = "White, non-Hispanic"),
         BYMOTHED = dplyr::recode(BYMOTHED,
                          `1` = "Did not finish high school",
                          `2` = "Graduated from high school or GED",
                          `3` = "Attended 2-year school, no degree",
                          `4` = "Graduated from 2-year school",
                          `5` = "Attended college, no 4-year degree",
                          `6` = "Graduated from college",
                          `7` = "Completed Master's degree or equivalent",
                          `8` = "Completed PhD, MD, other advanced degree"))


#rename columns to use pivot_longer
colnames(els)[colnames(els) %in% c("BYTXMSTD", "F1TXMSTD")] <- c("Base", "Follow-up")

els_longer <- els %>% 
  pivot_longer(
    cols = c('Base', 'Follow-up'),
    names_to = "YEAR",
    values_to = "MATH"
  )

els_wider_by <- els %>% 
  pivot_wider(
    id_cols = !'Follow-up',
    names_from = BYRACE,
    values_from = c(Base)
  )

els_wider_f1 <- els %>% 
  pivot_wider(
    id_cols = !Base,
    names_from = BYRACE,
    values_from = c('Follow-up')
  )

Visualization 1

vis1Data <- els_longer %>% 
  mutate(YEAR = factor(YEAR,
                       levels = c("Follow-up",
                                 "Base"))) %>%
  filter(!is.na(BYSEX)) %>%
  ggplot(aes(x=MATH,y=YEAR,fill=YEAR)) +
    geom_col(position="dodge", show.legend = FALSE) +
    facet_wrap(~ BYSEX,ncol=1) +
    labs(x="Math Scores",
         y="Year",
         title="Student Math Scores",
         subtitle="by year and sex"
         ) +
    scale_fill_manual(values = c("maroon", "gold")) +
  theme_light()
vis1Data

Visualization 2

vis2Data <- els_longer %>% 
  mutate(YEAR = factor(YEAR,
                       levels = c("Follow-up",
                                  "Base"))) %>%
  filter(!is.na(BYRACE)) %>%
  ggplot(aes(x=MATH,y=YEAR,fill=YEAR)) +
    geom_col(position="dodge", show.legend = FALSE) +
    facet_wrap(~ BYRACE,ncol=1) +
    labs(x="Math Scores",
         y="Year",
         title="Student Math Scores",
         subtitle="by year and race"
         ) +
    scale_fill_manual(values = c("maroon", "gold")) +
  theme_light()
vis2Data

# Alternate graph combining Visualization 1 & 2? Maybe easier that way?

vis2DataAlternate <- els_longer %>% 
  mutate(YEAR = factor(YEAR,
                       levels = c("Follow-up",
                                  "Base"))) %>%
  mutate(BYSEX = factor(BYSEX,
                       levels = c("Male",
                                  "Female"))) %>%
  filter(!is.na(BYRACE)) %>%
  ggplot(aes(x=MATH,y=YEAR,fill=BYSEX)) +
    geom_col(position="dodge") +
    facet_wrap(~ BYRACE,ncol=1) +
    scale_fill_discrete(breaks=c('Male', 'Female')) +
    labs(x="Math Scores",
         y="Year",
         fill = "Sex",
         title="Student Math Scores",
         subtitle="by year and race, separated by sex"
         ) +
    scale_fill_manual(values = c("maroon", "gold")) +
  theme_light()
vis2DataAlternate

Visualization 3

Distribution plots

These are simple distribution plots by Race, Sex, and Mother’s education for year 1 and year 2 for standardized math scores.

# Fixed the names to look better.
els_viz <- els_longer %>%
  mutate(RACE = dplyr::recode(BYRACE, "Native American/Alaskan" = "Native American\n /Alaskan",
                         "Asian" = "Asian", 
                         "Black" = "Black", 
                         "Hispanic (no race specified)" = "Hispanic", 
                         "Hispanic (specified)" = "Hispanic\n (Race specified)", 
                         "More than one race, non-Hispanic" = "2+ races\n non-Hispanic",
                         "White, non-Hispanic" = "White\n non-Hispanic"),
         MOTHED = dplyr::recode(BYMOTHED, 
                           "Did not finish high school" = "Did not finish\n high school",
                           "Graduated from high school or GED" = "Graduated high\n school or GED",
                           "Attended 2-year school, no degree" = "Attended 2-year school\n no degree",
                           "Graduated from 2-year school" = "Graduated 2-year\n school",
                           "Attended college, no 4-year degree" = "Attended college\n no degree",
                           "Graduated from college" = "Graduated college",
                           "Completed Master's degree or equivalent" = "Master's degree",
                           "Completed PhD, MD, other advanced degree" = "PhD, MD,other\nadvanced degree")) %>%
  mutate(RACE = factor(RACE, levels = c("White\n non-Hispanic",
                                        "Black",
                                        "Hispanic",
                                        "Hispanic\n (Race specified)",
                                        "Asian",
                                        "Native American\n /Alaskan",
                                        "2+ races\n non-Hispanic")),
         MOTHED = factor(MOTHED, levels = c("Did not finish\n high school",
                                            "Graduated high\n school or GED",
                                            "Attended 2-year school\n no degree",
                                            "Graduated 2-year\n school",
                                            "Attended college\n no degree",
                                            "Graduated college",
                                            "Master's degree",
                                            "PhD, MD,other\nadvanced degree")))


# A plot of the distribution of math scores by race in year 1 and follow up  
els_viz %>%
  filter(!is.na(MATH) & !is.na(RACE)) %>%
  ggplot(aes(x = MATH)) +
  geom_histogram(col='black',fill='white')+
  theme_minimal() +
  xlab("Math Scores") +
  xlim(10,90)+
  facet_wrap( ~ RACE + YEAR, nrow = 2, ncol=7)+
  theme(strip.background =element_rect(fill="white"))

# A plot of the distribution of math scores by sex in year 1 and follow up.
els_viz %>%
  filter(!is.na(MATH) & !is.na(RACE) & !is.na(BYSEX)) %>%
  ggplot(aes(x = MATH)) +
  geom_histogram(col='black',fill='white')+
  theme_minimal() +
  xlab("Math Scores") +
  xlim(10,90)+
  facet_wrap( ~ BYSEX + YEAR, nrow = 1, ncol=4)+
  theme(strip.background =element_rect(fill="white"))

# A Plot of distribution of math scores by mother's education for year 1 and follow up. 
els_viz %>%
  filter(!is.na(MATH) & !is.na(MOTHED)) %>%
  ggplot(aes(x = MATH)) +
  geom_histogram(col='black',fill='white')+
  theme_minimal() +
  xlab("Math Scores") +
  xlim(10,90)+
  facet_wrap( ~ MOTHED + YEAR, nrow = 2, ncol=8)+
  theme(strip.background =element_rect(fill="white"))

# Honestly I don't love the visualization, so let's look at a box plot of the data.

# How about a density plot?

Boxplots

A boxplot of the standardized math scores by Race, Sex, and Mother’s education.

# Boxplot of math scores by Race and Sex separated by Year
els_viz %>%
  filter(!is.na(MATH) & !is.na(RACE)) %>%
  ggplot(aes(x= RACE, y=MATH)) +
  geom_boxplot(aes(fill = RACE), show.legend = FALSE)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_viridis_d(option = 'plasma')+
  theme_minimal()+
  facet_wrap(~BYSEX)+
  labs(x = "", 
       y = "Math Scores",
       title = "Math Score by Race and Year",
       subtitle = "Separated by sex")+
  coord_flip()

# Boxplot of math scores by Mother's education separated by Year
els_viz %>%
  filter(!is.na(MATH) & !is.na(MOTHED)) %>%
  ggplot(aes(x= MOTHED, y=MATH)) +
  geom_boxplot(aes(fill = MOTHED), show.legend = FALSE)+
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  coord_flip()+
  scale_fill_viridis_d(option = 'mako')+
  theme_minimal()+
  facet_wrap(~YEAR)+
  labs(x = "", 
       y = "Math Scores",
       title = "Math Score by Mother Education and Year")

# Boxplot of math scores by sex separated by year
els_viz %>%
  filter(!is.na(MATH) & !is.na(BYSEX)) %>%
  ggplot(aes(x= BYSEX, y=MATH)) +
  geom_boxplot(aes(fill = BYSEX), show.legend = FALSE)+
  scale_fill_viridis_d()+
  theme_minimal()+
  facet_wrap(~YEAR)+
  labs(x = "", 
       y = "Math Scores",
       title = "Math Score by Sex and Year")

Density Plots

Finally let’s examine the data using density plots.

# Density plot of math scores by Race and Sex.
els_viz %>%
  filter(!is.na(MATH) & !is.na(RACE)) %>%
  ggplot(aes(x = MATH, y = RACE))+
  geom_density_ridges(aes(fill = RACE), alpha=0.5)+
  scale_fill_viridis_d(option = 'plasma')+
  theme_minimal()+
  theme(legend.position = "none")+
  facet_wrap(~BYSEX)

# Denisty plot of math scores by Mother education
els_viz %>%
  filter(!is.na(MATH) & !is.na(MOTHED)) %>%
  ggplot(aes(x = MATH, y = MOTHED))+
  geom_density_ridges(aes(fill = MOTHED), alpha=0.5)+
  scale_fill_viridis_d()+
  theme_minimal()+
  theme(legend.position = "none")+
  labs(x = "Math Score",
       y = "Mother's Education Level")

# Density plot of math scores by mother education separated by year
els_viz %>%
  filter(!is.na(MATH) & !is.na(MOTHED)) %>%
  ggplot(aes(x = MATH, y = MOTHED))+
  geom_density_ridges(aes(fill = MOTHED), alpha=0.5)+
  scale_fill_viridis_d()+
  theme_minimal()+
  theme(legend.position = "none")+
  labs(x = "Math Score",
       y = "Mother's Education Level")+
  facet_wrap(~BYSEX, nrow = 1)

Descriptive Statistics

#Created table by Race 
By_Race <- els_longer %>%
  group_by(BYRACE) %>%
  summarize(race_n = n(), 
            mean_math = mean(MATH, na.rm = TRUE), 
           sd_math = sd(MATH, na.rm = TRUE))
#Use describeBy to get kurtosis and skew 
describeBy(els_longer$MATH, els_longer$BYRACE)
## 
##  Descriptive statistics by group 
## group: Asian
##    vars    n  mean    sd median trimmed   mad   min   max range skew kurtosis
## X1    1 2727 54.04 10.77  54.13   54.17 11.53 19.82 86.68 66.86 -0.1    -0.45
##      se
## X1 0.21
## ------------------------------------------------------------ 
## group: Black
##    vars    n  mean   sd median trimmed  mad   min   max range skew kurtosis
## X1    1 3627 44.34 8.47   44.1   44.19 8.67 19.94 76.32 56.38 0.17    -0.15
##      se
## X1 0.14
## ------------------------------------------------------------ 
## group: Hispanic (no race specified)
##    vars    n  mean   sd median trimmed  mad   min   max range skew kurtosis
## X1    1 1776 45.73 9.14  45.45   45.58 9.67 20.53 76.92 56.39 0.17    -0.26
##      se
## X1 0.22
## ------------------------------------------------------------ 
## group: Hispanic (specified)
##    vars    n  mean   sd median trimmed   mad   min   max range skew kurtosis
## X1    1 2182 45.91 9.94  45.69   45.76 10.33 21.96 75.94 53.98 0.15    -0.36
##      se
## X1 0.21
## ------------------------------------------------------------ 
## group: More than one race, non-Hispanic
##    vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 1311 50.49 9.68   50.6   50.73 9.71 22.33 80.59 58.26 -0.19    -0.13
##      se
## X1 0.27
## ------------------------------------------------------------ 
## group: Native American/Alaskan
##    vars   n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 234 45.64 8.18  45.48   45.56 8.65 24.07 72.75 48.68 0.16    -0.07 0.53
## ------------------------------------------------------------ 
## group: White, non-Hispanic
##    vars     n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 16229 52.92 9.25  53.27   53.17 9.41 19.38 82.63 63.25 -0.24    -0.14
##      se
## X1 0.07
By_Race %>%
  kbl(caption = "Math Descriptives by Race", 
      digits = 2, 
      col.names = c("Reported Race", "n", "Mean ", "SD")) %>%
  kable_classic()
Math Descriptives by Race
Reported Race n Mean SD
Asian 2920 54.04 10.77
Black 4040 44.34 8.47
Hispanic (no race specified) 1992 45.73 9.14
Hispanic (specified) 2442 45.91 9.94
More than one race, non-Hispanic 1470 50.49 9.68
Native American/Alaskan 260 45.64 8.18
White, non-Hispanic 17364 52.92 9.25
NA 1804 49.52 9.48
#table for Mothers Education 
By_MotherED <- els_longer %>%
  group_by(BYMOTHED) %>%
  summarize(mothed_n = n(), 
            mean_math = mean(MATH, na.rm = TRUE), 
            sd_math = sd(MATH, na.rm = TRUE))

describeBy(els_longer$MATH, els_longer$BYMOTHED)
## 
##  Descriptive statistics by group 
## group: Attended 2-year school, no degree
##    vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 3418 49.62 9.36  49.99   49.78 9.64 19.94 76.34  56.4 -0.15     -0.3
##      se
## X1 0.16
## ------------------------------------------------------------ 
## group: Attended college, no 4-year degree
##    vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 2974 51.52 9.31  51.88   51.74 9.67 23.55 80.59 57.04 -0.19    -0.23
##      se
## X1 0.17
## ------------------------------------------------------------ 
## group: Completed Master's degree or equivalent
##    vars    n  mean  sd median trimmed  mad min   max range  skew kurtosis   se
## X1    1 2019 56.87 9.4  57.49   57.28 9.44  23 80.21 57.21 -0.46     0.25 0.21
## ------------------------------------------------------------ 
## group: Completed PhD, MD, other advanced degree
##    vars   n  mean    sd median trimmed   mad   min   max range  skew kurtosis
## X1    1 583 55.64 11.11  56.69   56.24 11.08 22.05 78.65  56.6 -0.48    -0.13
##      se
## X1 0.46
## ------------------------------------------------------------ 
## group: Did not finish high school
##    vars    n  mean   sd median trimmed  mad   min   max range skew kurtosis
## X1    1 3343 44.63 9.39  44.24   44.39 9.56 20.34 80.02 59.68 0.25    -0.17
##      se
## X1 0.16
## ------------------------------------------------------------ 
## group: Graduated from 2-year school
##    vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 3014 51.08 9.31  51.25   51.23 9.42 21.54 86.68 65.14 -0.13    -0.19
##      se
## X1 0.17
## ------------------------------------------------------------ 
## group: Graduated from college
##    vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis
## X1    1 5327 54.63 9.45  55.22   54.94 9.46 23.37 84.85 61.48 -0.28    -0.13
##      se
## X1 0.13
## ------------------------------------------------------------ 
## group: Graduated from high school or GED
##    vars    n  mean   sd median trimmed  mad   min max range skew kurtosis   se
## X1    1 7460 48.69 9.44  48.74   48.69 9.84 19.38  84 64.62 0.01    -0.28 0.11
By_MotherED %>%
  kbl(caption = "Math Descriptives by Mother's Education Level", 
      digits = 2, 
      col.names = c("Mother's Education Level", "n", "Mean ", "SD")) %>%
  kable_classic()
Math Descriptives by Mother’s Education Level
Mother’s Education Level n Mean SD
Attended 2-year school, no degree 3696 49.62 9.36
Attended college, no 4-year degree 3178 51.52 9.31
Completed Master’s degree or equivalent 2120 56.87 9.40
Completed PhD, MD, other advanced degree 622 55.64 11.11
Did not finish high school 3862 44.63 9.39
Graduated from 2-year school 3240 51.08 9.31
Graduated from college 5640 54.63 9.45
Graduated from high school or GED 8234 48.69 9.44
NA 1700 49.81 9.24
 #table for year            
By_Year <- els_longer %>%
  group_by(YEAR) %>%
  summarize(year_n = n(), 
            mean_math = mean(MATH, na.rm = TRUE), 
            sd_math = sd(MATH, na.rm = TRUE))

describeBy(els_longer$MATH, els_longer$YEAR)
## 
##  Descriptive statistics by group 
## group: Base
##    vars     n  mean   sd median trimmed   mad   min   max range skew kurtosis
## X1    1 15892 50.71 9.91  50.83   50.84 10.23 19.38 86.68  67.3 -0.1    -0.19
##      se
## X1 0.08
## ------------------------------------------------------------ 
## group: Follow-up
##    vars     n  mean    sd median trimmed   mad   min   max range  skew kurtosis
## X1    1 13648 50.66 10.11  50.85   50.74 10.85 19.82 79.85 60.03 -0.07    -0.51
##      se
## X1 0.09
By_Year %>%
  kbl(caption = "Math Descriptives by Year", 
      digits = 2, 
      col.names = c("Year", "n", "Mean ", "SD")) %>%
  kable_classic()
Math Descriptives by Year
Year n Mean SD
Base 16146 50.71 9.91
Follow-up 16146 50.66 10.11
#Table by sex 
By_Sex <- els_longer %>%
  group_by(BYSEX) %>%
  summarize(sex_n = n(), 
            mean_math = mean(MATH, na.rm = TRUE), 
            sd_math = sd(MATH, na.rm = TRUE))

 describeBy(els_longer$MATH, els_longer$BYSEX)
## 
##  Descriptive statistics by group 
## group: Female
##    vars     n  mean   sd median trimmed  mad   min max range skew kurtosis   se
## X1    1 14203 50.11 9.64  50.37   50.23 10.2 19.82  84 64.18 -0.1    -0.37 0.08
## ------------------------------------------------------------ 
## group: Male
##    vars     n  mean    sd median trimmed   mad   min   max range skew kurtosis
## X1    1 13966 51.35 10.41  51.42   51.48 10.85 19.38 86.68  67.3 -0.1    -0.37
##      se
## X1 0.09
By_Sex %>%
  kbl(caption = "Math Descriptives by Sex", 
      digits = 2, 
      col.names = c("Reported Sex", "n", "Mean ", "SD")) %>%
  kable_classic()
Math Descriptives by Sex
Reported Sex n Mean SD
Female 15400 50.11 9.64
Male 15254 51.35 10.41
NA 1638 49.96 9.09

More Statistics

ANOVA Results

ANOVA Math scores by Mother Education and Race

library(car)
math_mod <- lm(MATH ~ 1 + MOTHED*RACE, data = els_viz)
Anova(math_mod, type = 3)
## Anova Table (Type III tests)
## 
## Response: MATH
##              Sum Sq    Df    F value    Pr(>F)    
## (Intercept) 1845001     1 23299.2150 < 2.2e-16 ***
## MOTHED       149753     7   270.1601 < 2.2e-16 ***
## RACE          27333     6    57.5283 < 2.2e-16 ***
## MOTHED:RACE   15813    42     4.7546 < 2.2e-16 ***
## Residuals   2219619 28030                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Too many comparisons. Instead we are going to look at one way anova comparisons
aovRaceMothed <- aov(MATH ~ RACE*MOTHED, data = els_viz)
tukey_all <- TukeyHSD(aovRaceMothed, conf.level = 0.95)

Post Hoc Tukeys HSD Race and Mother’s education

Tukeys HSD Race Pairs

aovRace <-  aov(MATH ~ RACE,data=els_viz)
tukey_race <- TukeyHSD(aovRace,conf.level = 0.95)

# Visualization of Tukeys HSD pairwise comparisons by race
plot(tukey_race, col = "brown")

# Let's look if Mother's education is a predictor. Did not finish high school is the reference group
#contrasts(els_viz$MOTHED)

Tukeys HSD Mother’s Ed pairs

aovMothed <-  aov(MATH ~ MOTHED,data=els_viz)
tukey_mothed <- TukeyHSD(aovMothed,conf.level = 0.95)

# Visualization of Tukeys HSD pairwise comparisons by race
plot(tukey_mothed, col = "red")

# Let's look if Mother's education is a predictor. Did not finish high school is the reference group
#contrasts(els_viz$MOTHED)

Student’s T test Analysis

Difference in scores between year 1-2

#str(els_viz)
# Paired student's t test to examine if means from year 1 and year 2 are significantly different

# Created a small data set of just years and scores plus student id
els_byyear <- els_viz %>%
  group_by(YEAR)%>%
  filter(!is.na(MATH) | !is.na(YEAR)) %>%
  select(STU_ID, YEAR, MATH)%>%
  pivot_wider(names_from = YEAR,
              values_from = MATH) %>%
  rename("Follow" = "Follow-up")

# Paired student's t test to examine if the mean from the follow up is significantly greater than the base year mean
t.test(els_byyear$Follow, els_byyear$Base, paired = TRUE, alternative = "greater")
## 
##  Paired t-test
## 
## data:  els_byyear$Follow and els_byyear$Base
## t = -20.537, df = 13393, p-value = 1
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  -0.9029677        Inf
## sample estimates:
## mean difference 
##      -0.8360042
# Paired student's t test to examine if means from base year and follow up are significantly different.
t.test(els_byyear$Base, els_byyear$Follow, paired = TRUE)
## 
##  Paired t-test
## 
## data:  els_byyear$Base and els_byyear$Follow
## t = 20.537, df = 13393, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.7562106 0.9157978
## sample estimates:
## mean difference 
##       0.8360042
# Visualization of the means by year. The t test suggests that there is a significant difference in scores by year, despite the means being similar, unfortunately, the mean of the follow up is significantly less than the base year. I was thinking we could focus on visualizing the data by race and mother's education and sex rather than comparing year 1 to year 2 data.
ggwithinstats(data = els_viz, x = YEAR, y = MATH, 
              type = "parametric", 
              centrally.plotting = TRUE, 
              pairwise.display = "s", 
              point.path = FALSE,
              point.args = aes(size = 0, alpha = 0.2),
              results.subtitle = FALSE,
              alternative = "greater")

Regression Analysis

# First let us look if Race is a predictor. I will set white as the reference group since it is the largest group. 
#contrasts(els_viz$RACE)
# Looks like I set white as the reference earlier.  

mod_race <- lm(MATH ~ 1 + RACE, els_viz)
tab_model(mod_race)
  MATH
Predictors Estimates CI p
(Intercept) 52.92 52.78 – 53.07 <0.001
RACE [Black] -8.59 -8.93 – -8.25 <0.001
RACE [Hispanic] -7.19 -7.65 – -6.73 <0.001
RACE [Hispanic (Race
specified)]
-7.02 -7.43 – -6.60 <0.001
RACE [Asian] 1.11 0.73 – 1.49 <0.001
RACE [Native American
/Alaskan]
-7.28 -8.49 – -6.07 <0.001
RACE [2+ races
non-Hispanic]
-2.43 -2.96 – -1.91 <0.001
Observations 28086
R2 / R2 adjusted 0.127 / 0.126
# This just tells us that all groups are significantly different from each other
# I think we could report pairwise comparisons

race_pairs <- pairwise.t.test(els_viz$MATH, els_viz$RACE, p.adjust.method = "bonf")
race_pval <- race_pairs$p.value %>%
  round(digits = 3)

options(knitr.kable.NA = "")
race_pval %>%
  kbl(caption = "p -values of Math Score by Race",
      digits = 3) %>%
  kable_classic()
p -values of Math Score by Race
White non-Hispanic Black Hispanic Hispanic (Race specified) Asian Native American /Alaskan
Black 0
Hispanic 0 0.000
Hispanic (Race specified) 0 0.000 1
Asian 0 0.000 0 0
Native American /Alaskan 0 0.808 1 1 0
2+ races non-Hispanic 0 0.000 0 0 0 0
# Based on this, most groups are significantly different **except** for black/Native, Hispanic/Hispanic (Race specified), Native/Hispanic, Native/Hispanic (Race specified)

mod_mothed <- lm(MATH ~ 1 + MOTHED, els_viz)
tab_model(mod_mothed)
  MATH
Predictors Estimates CI p
(Intercept) 44.63 44.31 – 44.95 <0.001
MOTHED [Graduated high
school or GED]
4.07 3.68 – 4.45 <0.001
MOTHED [Attended 2-year
school no degree]
4.99 4.54 – 5.44 <0.001
MOTHED [Graduated 2-year
school]
6.45 5.98 – 6.91 <0.001
MOTHED [Attended college
no degree]
6.89 6.43 – 7.36 <0.001
MOTHED [Graduated
college]
10.01 9.60 – 10.41 <0.001
MOTHED [Master’s degree] 12.25 11.72 – 12.77 <0.001
MOTHED [PhD, MD,other
advanced degree]
11.01 10.18 – 11.84 <0.001
Observations 28138
R2 / R2 adjusted 0.118 / 0.117
# This just tells us that groups are significantly different from each other
# I think we could report pairwise comparisons

mothed_pairs <- pairwise.t.test(els_viz$MATH, els_viz$MOTHED, p.adjust.method = "bonf")
mothed_pval <- mothed_pairs$p.value %>%
  round(digits = 3)

options(knitr.kable.NA = "")
mothed_pval %>%
  kbl(caption = "p -values of Math Score by Mother's Education level",
      digits = 3) %>%
  kable_classic()
p -values of Math Score by Mother’s Education level
Did not finish high school Graduated high school or GED Attended 2-year school no degree Graduated 2-year school Attended college no degree Graduated college Master’s degree
Graduated high school or GED 0
Attended 2-year school no degree 0 0
Graduated 2-year school 0 0 0
Attended college no degree 0 0 0 1
Graduated college 0 0 0 0 0
Master’s degree 0 0 0 0 0 0.000
PhD, MD,other advanced degree 0 0 0 0 0 0.409 0.151

Mother Education / Race Regression

I am not sure how to do a regression analysis of all three variables, but I can look at the regression analysis of math scores based on race/ethnicity and mother’s education.

mod_moth_race <- lm(MATH ~ 1 + MOTHED*RACE, data = els_viz)
#summary(mod_moth_race)

tab_model(mod_moth_race)
  MATH
Predictors Estimates CI p
(Intercept) 46.40 45.80 – 46.99 <0.001
MOTHED [Graduated high
school or GED]
3.77 3.12 – 4.42 <0.001
MOTHED [Attended 2-year
school no degree]
5.48 4.77 – 6.18 <0.001
MOTHED [Graduated 2-year
school]
6.37 5.65 – 7.09 <0.001
MOTHED [Attended college
no degree]
7.28 6.55 – 8.01 <0.001
MOTHED [Graduated
college]
9.47 8.80 – 10.13 <0.001
MOTHED [Master’s degree] 11.57 10.81 – 12.32 <0.001
MOTHED [PhD, MD,other
advanced degree]
12.11 10.99 – 13.22 <0.001
RACE [Black] -5.37 -6.37 – -4.36 <0.001
RACE [Hispanic] -2.98 -3.88 – -2.08 <0.001
RACE [Hispanic (Race
specified)]
-4.74 -5.67 – -3.81 <0.001
RACE [Asian] 2.70 1.78 – 3.62 <0.001
RACE [Native American
/Alaskan]
-2.66 -5.71 – 0.39 0.087
RACE [2+ races
non-Hispanic]
-1.15 -3.03 – 0.74 0.233
MOTHED [Graduated high
school or GED] × RACE
[Black]
-2.16 -3.34 – -0.99 <0.001
MOTHED [Attended 2-year
school no degree] × RACE
[Black]
-2.70 -4.02 – -1.39 <0.001
MOTHED [Graduated 2-year
school] × RACE [Black]
-2.71 -4.06 – -1.35 <0.001
MOTHED [Attended college
no degree] × RACE [Black]
-2.48 -3.84 – -1.12 <0.001
MOTHED [Graduated
college] × RACE [Black]
-3.00 -4.30 – -1.70 <0.001
MOTHED [Master’s degree]
× RACE [Black]
-1.76 -3.53 – 0.01 0.052
MOTHED [PhD, MD,other
advanced degree] × RACE
[Black]
-8.36 -10.97 – -5.75 <0.001
MOTHED [Graduated high
school or GED] × RACE
[Hispanic]
-2.27 -3.56 – -0.98 0.001
MOTHED [Attended 2-year
school no degree] × RACE
[Hispanic]
-2.70 -4.25 – -1.15 0.001
MOTHED [Graduated 2-year
school] × RACE [Hispanic]
-2.06 -3.83 – -0.30 0.022
MOTHED [Attended college
no degree] × RACE
[Hispanic]
-1.93 -3.61 – -0.25 0.024
MOTHED [Graduated
college] × RACE
[Hispanic]
-4.12 -5.93 – -2.30 <0.001
MOTHED [Master’s degree]
× RACE [Hispanic]
-0.98 -3.72 – 1.75 0.481
MOTHED [PhD, MD,other
advanced degree] × RACE
[Hispanic]
-2.82 -6.36 – 0.73 0.119
MOTHED [Graduated high
school or GED] × RACE
[Hispanic (Race
specified)]
-0.01 -1.22 – 1.21 0.991
MOTHED [Attended 2-year
school no degree] × RACE
[Hispanic (Race
specified)]
-0.54 -2.04 – 0.97 0.484
MOTHED [Graduated 2-year
school] × RACE [Hispanic
(Race specified)]
0.07 -1.63 – 1.77 0.935
MOTHED [Attended college
no degree] × RACE
[Hispanic (Race
specified)]
-1.74 -3.32 – -0.16 0.031
MOTHED [Graduated
college] × RACE [Hispanic
(Race specified)]
-0.28 -1.73 – 1.16 0.701
MOTHED [Master’s degree]
× RACE [Hispanic (Race
specified)]
-2.37 -4.39 – -0.36 0.021
MOTHED [PhD, MD,other
advanced degree] × RACE
[Hispanic (Race
specified)]
-2.86 -6.13 – 0.41 0.087
MOTHED [Graduated high
school or GED] × RACE
[Asian]
1.65 0.45 – 2.85 0.007
MOTHED [Attended 2-year
school no degree] × RACE
[Asian]
-3.12 -4.74 – -1.51 <0.001
MOTHED [Graduated 2-year
school] × RACE [Asian]
-0.19 -1.81 – 1.43 0.818
MOTHED [Attended college
no degree] × RACE [Asian]
-3.39 -5.03 – -1.76 <0.001
MOTHED [Graduated
college] × RACE [Asian]
-1.65 -2.83 – -0.47 0.006
MOTHED [Master’s degree]
× RACE [Asian]
-1.67 -3.26 – -0.08 0.040
MOTHED [PhD, MD,other
advanced degree] × RACE
[Asian]
-4.36 -6.66 – -2.05 <0.001
MOTHED [Graduated high
school or GED] × RACE
[Native American
/Alaskan]
-1.15 -4.83 – 2.53 0.540
MOTHED [Attended 2-year
school no degree] × RACE
[Native American
/Alaskan]
-7.58 -12.71 – -2.44 0.004
MOTHED [Graduated 2-year
school] × RACE [Native
American /Alaskan]
-5.15 -10.04 – -0.26 0.039
MOTHED [Attended college
no degree] × RACE [Native
American /Alaskan]
-6.51 -10.67 – -2.36 0.002
MOTHED [Graduated
college] × RACE [Native
American /Alaskan]
-5.71 -9.94 – -1.49 0.008
MOTHED [Master’s degree]
× RACE [Native American
/Alaskan]
-5.88 -11.98 – 0.22 0.059
MOTHED [PhD, MD,other
advanced degree] × RACE
[Native American
/Alaskan]
7.61 -5.13 – 20.35 0.242
MOTHED [Graduated high
school or GED] × RACE [2+
races non-Hispanic]
-0.98 -3.09 – 1.14 0.364
MOTHED [Attended 2-year
school no degree] × RACE
[2+ races non-Hispanic]
-3.18 -5.54 – -0.82 0.008
MOTHED [Graduated 2-year
school] × RACE [2+ races
non-Hispanic]
-0.07 -2.53 – 2.39 0.958
MOTHED [Attended college
no degree] × RACE [2+
races non-Hispanic]
-0.23 -2.55 – 2.09 0.846
MOTHED [Graduated
college] × RACE [2+ races
non-Hispanic]
-0.56 -2.76 – 1.64 0.620
MOTHED [Master’s degree]
× RACE [2+ races
non-Hispanic]
-1.38 -4.00 – 1.24 0.302
MOTHED [PhD, MD,other
advanced degree] × RACE
[2+ races non-Hispanic]
-8.16 -12.03 – -4.30 <0.001
Observations 28086
R2 / R2 adjusted 0.214 / 0.213

Mother Edu / Race Regression Visualization

Now let’s try to visualize this, it might get messy.

# Visualization of regression models for mother's education and race
#install.packages("interactions")
cat_plot(mod_moth_race, pred = MOTHED, modx = RACE, geom = "line", interval = FALSE,vary.lty = TRUE)

Mother ED / Race Bar graph

This allows for the error bars to be included but crowds the data and makes analyzing the data across groups more difficult.

cat_plot(mod_moth_race, pred = MOTHED, modx = RACE, geom = "bar")